Examining the challenges of blood pressure estimation via photoplethysmogram

The use of observed wearable sensor data (e.g., photoplethysmograms [PPG]) to infer health measures (e.g., glucose level or blood pressure) is a very active area of research. Such technology can have a significant impact on health screening, chronic disease management and remote monitoring. A common approach is to collect sensor data and corresponding labels from a clinical grade device (e.g., blood pressure cuff) and train deep learning models to map one to the other. Although well intentioned, this approach often ignores a principled analysis of whether the input sensor data have enough information to predict the desired metric. We analyze the task of predicting blood pressure from PPG pulse wave analysis. Our review of the prior work reveals that many papers fall prey to data leakage and unrealistic constraints on the task and preprocessing steps. We propose a set of tools to help determine if the input signal in question (e.g., PPG) is indeed a good predictor of the desired label (e.g., blood pressure). Using our proposed tools, we found that blood pressure prediction using PPG has a high multi-valued mapping factor of 33.2% and low mutual information of 9.8%. In comparison, heart rate prediction using PPG, a well-established task, has a very low multi-valued mapping factor of 0.75% and high mutual information of 87.7%. We argue that these results provide a more realistic representation of the current progress toward the goal of wearable blood pressure measurement via PPG pulse wave analysis. For code, see our project page: https://github.com/lirus7/PPG-BP-Analysis

blood viscosity, and stiffness of blood vessels.Abnormally high or low blood pressure can result in heart attack, stroke, and diabetes 13,14 thus it is recommended to measure BP frequently.
The methods to measure blood pressure non-invasively can be broadly categorized into two approaches: (i) The pulse transit time (PTT) method [15][16][17] is a popular, non-invasive technique for measuring blood pressure based on the time delay for a pressure wave to travel between proximal and distal arterial sites.The PTT approach has strong theoretical underpinnings based on the Bramwell-Hill equation 18 , which relates PTT to pulse wave velocity and arterial compliance.The Wesseling model captures the relationship between arterial compliance and blood pressure 19 .However, it is important to note that, PTT can change independently of BP due to factors such as aging-induced arteriosclerosis, and smooth muscle contraction.Hence, it needs to be calibrated from time to time.(ii) Pulse Wave Analysis (PWA) is a method used to estimate blood pressure (BP) by extracting features from an arterial waveform.This is typically performed using a photoplethysmography (PPG) waveform.PPG is an optical signal obtained by illuminating the skin (common sites are the finger, earlobe, or toe 20 ) with an LED and measuring the amount of transmitted, or reflected, light using a photodiode.PPG detects blood volume changes in the microvascular bed of tissue, as the blood volume directly impacts the amount of light transmitted/ reflected.Unlike PTT, PWA has weaker theoretical underpinnings as the small arteries interrogated by PPG are viscoelastic 15 .Calibration is invariably necessary for PWA analysis methods to obtain reasonable results.
In this study, we concentrate on PWA measurement of BP.This method is beneficial because it only requires the use of a single sensor making it a more accessible solution.Predicting BP by analyzing PPG waveforms is an active area of research 7,[21][22][23][24][25] and is already used in consumer products (https:// www.samsu ng.com/ global/ galaxy/ what-is/ blood-press ure/).However, we should note that "while these methods (PTT and PWA) have been extensively studied and cuff-calibrated devices are now on the market, there is no compelling proof in the public domain indicating that they can accurately track intra-individual BP changes" 20,26 .Therefore, although the features extracted from the PPG signal correlate with blood pressure, the signal's adequacy for accurately predicting blood pressure remains unclear.
The discrepancy between recent research [27][28][29] claiming promising results on evaluation benchmarks for blood pressure, and other observational studies 20,26 which indicate a lack of a concrete theory to measure blood pressure using PPG signals via PWA, raises important questions.To help resolve this apparant contradiction, we conduct a comprehensive examination of the existing PWA techniques in the literature (Table 1).Our analysis reveals that a significant portion of the prior papers contain one or more of four common pitfalls: (a) Data Leakage: where data samples from the same patient are present in both the train and test sets, (b) Overconstraining: where data far from normal range is discarded as outliers, which statistically simplifies the task, (c) Unreasonable Calibration: where the calibration method is not tested over longer (e.g.,> 1 day) time scales, and (d) Unrealistic Preprocessing: which filters a significant portion of the dataset terming it as noisy.We analyze these pitfalls in detail in our results section.
Our analysis reveal a somewhat surprising lack of improvement (modulo the pitfalls above) in PPG-based blood pressure prediction.This is in contrast to the substantive improvements in non-invasive prediction of other vitals such as heart-rate during this time.This raises the question as to whether there is a limit/ceiling on the prediction accuracy.In order to answer this, we propose tools to examine whether an input sensor signal (x) (e.g., PPG) can be a good predictor of the output health label (y) (e.g., BP).For this, we want to evaluate whether an underlying function f exists, which captures the relationship between x and y, such that y = f (x) .We also want to measure the conditioning of this underlying function, and check whether it is well-conditioned or not?That is, whether small changes in x lead to small or large changes in y.It is important to ensure that (minor) noise in the sensor measurement (which is inevitable in a real-world setting) does not lead to significant error in the outputs.Our tool is based on information-theoretic notions of mutual information and multi-valued mappings.Using our proposed tool, we find that BP prediction using PPG has a high multi-valued mapping factor of 33.2% and low mutual information of 9.8%.In comparison, heart rate prediction using PPG, a well-established task, has a very low multi-valued mapping factor of 0.75% and high mutual information of 87.7%.This confirms that estimating BP from PPG is a challenging and an ill-conditioned problem and a more principled approach is needed in the future for framing such health measure prediction tasks.

Results
In this section, we present a systematic review of prior work predicting BP via PPG PWA (Figure 1), followed by a principled analysis using our proposed tools.

Review of the results and limitations of prior work
To motivate our work, we analyzed recent research [21][22][23]27,29,34,[48][49][50][51] that reported results predicting BP via PPG PWA (see Table 1). These works reied on the MIMIC 52 dataset (Appendix C.1) containing continuous PPG signals and the corresponding arterial BP values.They evaluated their performance against the AAMI 53 and/ or BHS 54 standards (Appendix C.2).We found that they were prey to some common pitfalls, which resulted in misleading claims and over-optimistic results.For simplicity, we focus on the prediction of Systolic BP (SBP) rather than Diastolic BP (DBP), as SBP has a wider statistical range.
Before we begin, we should note that not all work (e.g., [35][36][37][38]50 ) followed the AAMI/BHS standards accurately. Forexample, some reported results on a test-set of fewer than 85 subjects.Moreover, although these works use the same MIMIC dataset, we found a lack of standardization in the train-test data splits and different BP ranges used for evaluation (due to differences in how the data were filtered) across the literature 27,29 .With the absence of official source code, it was difficult to reproduce prior results and compare different methods.Hence, we trained our own reference deep learning model (Figure 2), similar to the methods presented in prior research 27,34,49 .The reference network takes a three-channel input consisting of the original PPG waveform, along with its first and second derivatives, and outputs the predicted SBP value.The model consists of an eight-layer residual CNN 55 with 1D convolutions, and is trained using a mean squared error loss.We also explored 2D convolution based CNN models, such as DenseNet-161 28 and ResNet-101 55 , taking spectrogram of the 1D PPG signal 27 and/or raw waveform as input.Among these, we found that the 1D CNN based architecture performed best.

Data leakage
The goal of any machine learning model is to generalize well to test data that will be seen in real-world settings 56 .Even with a large training set, it is very unlikely that identical samples to those seen in the training set will appear at test time, thus generalization is crucial.Unfortunately, good performance on a training dataset does not always translate to good performance on a test set, as models can overfit.This is especially true for modern deep neural networks, which are highly over-parameterized and can easily memorize the training data 57 .Thus, evaluating test performance accurately is an important step in understanding how a model will function in the real world.For this, the test data needs to be pristine, i.e., without any contamination from the training data.Unfortunately, contamination can and does happen in several ways.
We observed two types of overlap between training and testing splits (Figure 1A): data-overlap and domain-overlap.
Data-overlap corresponds to overlap of actual segments from a sample between the train-test sets.Domainoverlap is more subtle, where although there is no direct overlap of samples, leakage may occur due to similarities in train-test data.In our case, it corresponds to using different records from the same patient in both the test and train sets (Figure 3).
Here, we consider a particular example from the literature, PPG2ABP 21 , where the authors propose a U-Net based architecture to predict the ABP (Arterial BP) waveform from PPG.They obtain impressive results with a bias of −1.19 mmHg and error standard deviation (SD) of 8.01 mmHg (Note, there is an error in the computa- tion of standard deviation in the PPG2ABP 21 evaluation script.We report the corrected results here.)on the SBP prediction task (Table 2), which is close to the AAMI standard.However, while analyzing their source code, we found both data and domain overlaps.
Data-Overlap: The PPG2ABP 21 data processing pipeline divides each PPG record ( ∼ 6 mins long) into 10-sec- ond windows with an overlap of 5 seconds (URL: github.com/nibtehaz/PPG2ABP/blob/master/codes/data_processing.py)(Figure 3).Using overlapping windows helps, as it increases the size of the training data.However, the problem arises when these 10-second samples are randomly split into train and test sets.Since the overlapping windows are generated before the random train-test split, the train and test sets can have samples with the same overlapping regions (Figure 3).A deep learning model can memorize values based on these overlapping portions, leading to artificially high accuracy on the test set.
Domain-Overlap: Due to the physiological differences between individuals, person-dependent models often outperform person-independent models 58 .For example, for the BP prediction task, a model can learn the normal range of an individual's BP and leverage that to provide more accurate predictions.Since the knowledge of an individual's identity can impact a model's accuracy, it is important that the identity of the subject is not leaked (even implicitly) between test and train sets, especially while building person-independent models.Since the PPG signature has been shown to identify an individual 59 , the presence of PPG signals from the same individual in both train and test data can thus leak identity.This turns out to be the case in the PPG2ABP work 21 , as they randomly split PPG records into test and train sets, resulting in different windows from the same patient present in both test and train sets (Figure 3).
To quantitatively evaluate the impact of data leakage, we compare the performance of the PPG2ABP network on three splits (Figure 3) -(1) No-overlap: the dataset is partitioned at the patient level with an 80-20% train-test split, (2) Domain-Overlap: each patient has multiple records ( ∼ 6 mins long), and these records are randomly split 80-20% between the train-test set, i.e., records from the same patient can be present in both the training and test sets, and (3) Data-Overlap: We use the split provided by PPG2ABP 21 which divides the records into overlapping windows followed by an 80-20% train-test split.All splits consist of 10-second windows with an overlap of (A) providing the model with observations from similar patients, (B) constraining the task (e.g., limiting the distribution of labels), (C) calibrating models using data from a participant.When doing so it can often be difficult to identify how these steps impact the integrity of a model, or (D) preprocessing to filter out problematic samples (e.g., noisy inputs).
Overconstraining the task Health-related data typically have non-uniform Gaussian distributions, with the highest data density near the "normal" (or healthy) range, and falling exponentially as we move away from the normal.We observe a similar trend for BP data in both the Aurora-BP 60 (Appendix C.1) and MIMIC datasets (see Figure 4).While points far Table 1.The table summarizes the limitations of previous research and indicates whether the study exhibits specific pitfalls.The pitfalls are categorized into four categories: a) Data-split: Domain Overlap (denoted as D.O), Data Overlap (denoted as C.O), or Small test set (denoted as S.T).b) Over-constraining: SBP values and standard deviation (if provided) c) Unrealistic Pre-Processing: % of remaining dataset after pre-processing (if provided) d) Calibration: Correctly employed and justified for longer periods.The columns denote the presence or absence of each limitation, with Y (yes) indicating that the study has the limitation, N (no) indicating that it does not have that limitation, U (unknown) indicating that there is not enough information available, and "-" indicating that the research is not applicable to the pitfall.For additional information, please see Section Review of the results and limitations of prior work.Our reference network, is used to evaluate the impact on performance due to the issues mentioned in Section "Review of the Results and Limitations of Prior Work".The network has 28M trainable parameters, takes a 3-channel input (PPG, VPG, APG), and outputs the SBP prediction.The model is optimized using a mean squared error loss.
T est P 1 ,P 3 ,... P 2 ,P 4 ,...   from normal are rare, they are often crucial events (abnormally low or high BP) indicating serious health issues requiring medical attention.
However, we found that researchers often discard so-called "outliers" 22,27,29 (Figure 1B), arguing that such samples are unlikely or have occurred due to noise in the data collection process.For example, the MIMIC dataset has SBP values ranging between 65 and 200 mmHg (75-220 mmHg in Aurora-BP), but Schlesinger et al. 27 ignored samples outside the range of 75-165 mmHg, referring to the discarded values as "improbable".Similarly, Cao et al. 22 and Hill et al. 29 use a constrained range of 75-150 mmHg, while according to the British Hypertension Society literature, 140-159 mmHg is Grade-1 (mild) hypertension, 160-179 mmHg is Grade-2 (moderate) hypertension, and ≥180 mmHg is Grade-3 (severe) hypertension 54 .
Constraining the data range has two problems.First, it leads to an incomplete evaluation, as the model is neither trained nor tested on samples from the discarded ranges.Second, since the statistical range of the output is reduced, this makes the prediction task artificially "easier" (i.e., a lower error can be achieved more easily), which may result in promising but misleading results.To quantitatively study the impact of constraining data ranges, we conducted an experiment using our reference network with different filtering of the data range.Table 3 shows the performance of our network when trained with three different SBP ranges: 65-200, 75-165 and 75-150 mmHg.Even small restrictions in the output range can lead to a significant (perceived) improvement in accuracy, e.g., reducing the SBP upper limit from 165 to 150 mmHg results in an ∼11.4% improvement in the standard deviation.This can be explained as samples at the extremes often result in the highest prediction errors (as models tend to predict closer to the mean of the distribution making predictions on samples with very high or low ground-truth BP values the most inaccurate).
The exclusion of samples with SBP measurements outside the range ≥165 mmHg and ≤ 75 mmHg during the training of machine learning models may result in overlooking crucial physiological features, potentially concealing serious health conditions and introducing bias into the model.This practice not only limits the scope of the developed models but also hinders conclusions about their generalizability and real-world applicability, as they become less representative of the diverse patient populations they are intended to serve.

Unreasonable calibration
The relationships between health measures (e.g., PPG and BP) are often person dependent.For example, blood pressure (bp) is dependent on the patient's heart rate (hr), blood viscosity (visc), stiffness of blood vessels (stif), etc., i.e., bp = f (hr, visc, stif , ...) .While the PPG signal might capture heart rate well, it may not be able to capture viscosity-and stiffness-related information.To solve this problem, it is common to propose the use of a calibration step, wherein a few PPG samples from each patient along with gold-standard BP values are used to calibrate the function f for that patient (Figure 1C).The model then learns a calibrated function, f , for a specific patient, i.e., bp = f (hr) , where the patient-specific parameters (visc, stif, ...) are folded into f .The literature does not offer a universally effective calibration strategy.Cao et al.'s 22 method needs to be calibrated every time before a BP prediction to find the optimal fit on the wrist for the watch, while Schlesinger et al. 's 27 model needs to be calibrated once to find the offset value between the model and the true prediction.As blood pressure may not change drastically within minutes (at rest) and significant trends might be observed only over the course of a few months owing to lifestyle changes or the influence of medication 61 , it becomes important to pay attention to questions such as: What is the frequency of re-calibration?Is the calibration approach prone to changes in other environmental factors?We believe that the calibration approaches reported in prior work risk over-fitting by memorizing patient-level local temporal characteristics, and that evaluation is incomplete given that they do not evaluate BP prediction over longer time scales.
To understand the influence of calibration, we evaluate the prediction performance under different calibration strategies.Naïve Calibration simply predicts a constant calibrated value for the entire record.The constant value is computed as the mean of the ground truth values of the first three windows of a record.Offset Calibration uses our reference network, but adds an offset to the predicted value.The offset is computed in the calibration step as the difference between the predicted and ground truth BP of the test record's first window.We found the Naïve Calibration to perform very well (Table 4), with a standard deviation of 8.61 mmHg, close to the AAMI standard.However, predicting a constant BP value for a patient is clearly incorrect.This inconsistency underscores problems with the evaluation methodology.Since typical records in MIMIC have short time intervals (average length = 6 minutes) compared to the time scales at which BP changes, predicting a constant value gives deceivingly good accuracy.An appropriate evaluation of calibration methods should consider time scales spanning the intended re-calibration duration.For example, if re-calibration is planned every six months, the method should be evaluated with patients tracked over at least a six month time period.To demonstrate that calibration systems can quickly deteriorate over time, we analyzed the performance of Offset Calibration as the www.nature.com/scientificreports/time from the calibration window increases.Although the method performs well for the first few days, the error rates increase dramatically after that (Figure 5A).

Unrealistic preprocessing or filtering
The MIMIC dataset comprises ICU-patients data, with artifacts due to patient movement, sensor degradation, transmission errors from bedside monitors, and human errors in post-processing data alignment.The impact of these artifacts is visible in both the PPG and ABP waveforms as missing data, noisy data, and sudden changes in amplitude and frequency (Figure 6).To clean the signal, researchers 27,29 have used band-pass filters to remove noise in the high frequency ( ≥ 16 Hz) and low frequency ( ≤0.5 Hz) ranges, followed by auto-correlation to filter signals that are not strongly correlated with themselves.The auto-correlation step removes samples with uneven amplitude and/or frequency.After cleaning the MIMIC dataset (Figure 1D), Schlesinger et al. 27 used less than 5% of the total data for training their neural network, while Hill et al. 29 and Slapnicar et al. 34 used less than 10% of the total MIMIC data.This suggests that "clean" data is rare.Although filtering datasets to remove some noise is often an essential step to train a machine learning model 56 , excessive filtering of data can result in overfitting.
Models trained on such clean data might achieve high performance on a clean test set; however, they might fail in practice, as it is difficult to obtain such clean signals in a real-world scenario.
To understand the impact of filtering on a dataset, we measure the performance of our reference network at different auto-correlation thresholds.Figure 5(B) plots the performance of our reference network in predicting SBP  and the percentage of filtered data for each auto-correlation threshold.The performance of the network improves by 29.7% and the dataset size decreases by 63%, as we increase the auto-correlation threshold from 0 to 0.8.

Our proposed principled approach
We propose and utilize two tools-based on multi-valued mappings and on mutual information (Appendix B)to estimate if the input signal is a good predictor of the output.Using our proposed tools we performed a principled analysis to study the relationship between PPG and BP.For comparison, we also used our tools on heart rate (HR) and reflected wave arrival time (RWAT) estimation for which it is known that the PPG signal is a strong predictor.
Checking for Multi-Valued Mappings: We use Algorithm 1 to find multi-valued mappings corresponding to data samples that are close in the input space but distant in the output space.As discussed in Section B.1, to compute the distance between two PPG inputs, we first align them using cross-correlation, followed by computing their Euclidean distance.We divide the dataset records into non-overlapping two-second windows and treat them as individual inputs.We set an input distance threshold of 1.0, which corresponds to a per-time sample threshold of 4e − 3 (each 2s PPG window had 250 samples).For the output, we set thresholds of 8 mmHg, 8 bps, and 0.02s for the BP, HR and RWAT prediction tasks, respectively.We found very few multi-valued mappings for the HR and RWAT tasks, while a large number of mappings for the SBP task (Table 5).In the MIMIC dataset, for 33.2% of the 2-second windows, we found another window for the same patient who was close in the input PPG space but had a significantly different SBP output.When limiting the search to different patients, for 15.0% of the windows we could still find such matches.This implies that the task of predicting BP from PPG is ill-conditioned.Figure 7 shows examples of such multi-valued mappings, with highly similar input PPG waveforms but significantly different output arterial BP waveforms.In comparison, for the HR and RWAT tasks, the number of such matches is much smaller at 0.02% and 0.08% intra-patient, respectively, suggesting much better conditioning.
In the process of filtering multi-valued mappings, it is essential to consider the specificity of sensors and the methodologies employed in preprocessing the input data.Our analysis focuses on intra-patient and interpatient multi-valued mappings within specific datasets, namely MIMIC and AURORA, rather than across different datasets.This approach ensures that our findings are not confounded by variations in sensor quality or the nuances of measurement techniques.Additionally, it enables us to apply preprocessing steps that preserve amplitude information.
Evaluating Mutual Information: To estimate mutual information (MI) between the PPG signal and the target output (BP/HR/RWAT), we use the K-nearest neighbours based approach proposed by Kraskov et al. 62 .We leverage dimensionality reduction to make MI estimation tractable, using handcrafted and auto-encoder learned feature representations.We report the mutual information of the input features and target variable, as well as the entropy of the target variable.Note that the target variable's entropy is the maximum achievable mutual information.Thus, the ratio of MI and target variable entropy represents the target information fraction encoded by the input, which we call Info-Fraction.We found Info-Fraction to be a more intuitive measure than the absolute MI values, and use it to compare the predictive power of PPG across the different tasks.
Handcrafted Features: As suggested by Takazawa 63 and Elgendi et al. 64 , we calculate handcrafted features (see Table 6) from the PPG waveform (Figure 8).Due to the absence of a time-aligned ECG waveform in the MIMIC dataset, we extracted the relevant handcrafted features only from the PPG waveform.Table 7 presents the MI of these individual features with respect to the BP prediction task for both the MIMIC and Aurora-BP datasets, along with the MI when all these features are combined and regarded as a single multi-dimensional input.We found that even the combined features set encode a small fraction of the total target entropy.For example, in the MIMIC dataset, the combined features' Info-Fraction is just 9.5%, while heart rate itself contributes an Info-Fraction of 4.1%.Similar observations hold true for the Aurora-BP dataset.This hints that the PPG signal does not have enough information to predict BP in this dataset, and moreover the prediction is highly dependent on the heart rate.
For the Aurora-BP dataset we have the demographic data (age, weight, height) of the subjects, as well as time-aligned PPG and ECG waveforms.This allows us to calculate additional features, e.g., radial Pulse Arrival Time (rPAT) and other derived features 60 .Prior work 7 has used PAT to estimate blood pressure.Moreover, the Aurora-BP dataset has multiple readings for each subject in different positions (e.g., sitting, at rest, and supinated) which helps us add delta features reflecting the difference between features in the two conditions.Despite this, we found the entropy results for the Aurora-BP dataset to be similar to the MIMIC dataset, with the handcrafted features able to capture only 9.8% of the entropy of blood pressure (Table 8).On the other hand, for the HR and RWAT prediction tasks, the handcrafted features captured 87.7% and 64.6% entropy, respectively (ground truth for HR is derived from the ECG sensor data and RWAT from the tonometric sensor data).This Table 5. Multi-valued mapping matches for the BP, HR and RWAT prediction tasks.For the BP task, there was a high match rate for both within the same patient records and across patients, suggesting an ill-conditioned problem.For the HR and RWAT tasks, the matches were much lower.Ground truth for RWAT is only available for the Aurora-BP dataset.Examples of PPG waveforms (PPG i and PPG j ) that are very similar and have corresponding arterial blood pressure waveforms (ABP i and ABP j ) that are quite different.This highlights the existence of similar features that map to different targets, which makes the task of blood pressure prediction via PPG pulse wave analysis ill-conditioned.
Table 6.Descriptions of the handcrafted features used for the Mutual Information analyses.

Feature Description
Heart Rate (HR) Measurement of the number of pulsations of the heart in a minute.Calculated as the inverse of median time between each heart beat.The PPG signal was used for MIMIC (because time alignment with the ECG signal was not precise), the ECG signal was used for Aurora-BP.
Heart Rate Variability (HRV) Measurement of the variation in time between each heart beat.Calculated as the mean of standard deviations of normal-normal (NN) intervals (SDNN).

Quality
Measurement of the quality of the PPG signal.A heuristic based algorithm that takes the signal-to-noise ratio, artifacts, consistency between the pulses in a window into consideration and computes a normalized score between 0 and 1.

dp dt
Measurement of the mean systolic rise times normalized with respect to the duration of each beat in the PPG signal.
rPAT Measurement of the delay between the R-peak in the ECG signal and systolic peak of the PPG signal.This can only be computed for Aurora-BP due to imprecise synchronization in MIMIC.

Feature
Measured as the difference between the calculated value and baseline value.A baseline value of each feature for all patients is computed in Aurora-BP (not available for MIMIC).
std.Feature Measures the fluctuation of a feature across a fixed time period.
further strengthens our finding that the PPG signal even with additional information from the ECG waveform has limited information to predict BP.Auto-encoder Features: As an alternative to handcrafted features, we train an auto-encoder on the raw PPG waveform to obtain a set of low dimensional features.We use a five-layer perceptron (MLP) auto-encoder with ReLU activation and a bottleneck layer of 20 neurons.The model was trained with the Adam optimizer (learning rate of 0.001) and a mean-squared error loss (with a stopping point when the loss saturated at <0.1).Training time on a single NVIDIA P100 was under an hour.Table 9 shows the MI of the combined bottleneck features with respect to the BP, HR and RWAT prediction tasks.Although the auto-encoder features are more comprehensive and have higher MI compared to the hand-crafted features, the Info-Fraction for BP prediction (12.9% for MIMIC and 8.7% for Aurora-BP) is still much lower compared to that for HR (92.2% for MIMIC and 93.1% for Aurora-BP) and RWAT (70.1% for Aurora-BP) prediction tasks.
There are two possible implications of these findings.First, it may suggest that PPG signals lacks adequate information for accurate BP prediction.Alternativelty, it could imply a limitation in the current sensor technology, making sensors susceptible to confounding factors like external noise and environmental variations, thereby hindering the accuracy of BP prediction.

Conclusion
Our results reveal that BP prediction via pulse wave analysis of the PPG signal is still an unsolved task and far from the acceptable AAMI and BHS standards.By performing a systematic review and accompanying experiments we found several issues being overlooked in the prior work that have led to seemingly over-optimistic results.These pitfalls can be categorized into data splits that leak information from test samples into the training set, heavy constraints on the task that remove challenging samples and reduce the range of target values substantially, calibration methods that seem to be practically problematic, and unreasonable preprocessing that filters the data to an unrealistic extent such that any noise is unacceptable.These pitfalls simplify the machine learning task, creating a deceptive perception of ease in model training, which results in inflated performance.Ultimately, this translates to models that overfit the training data, hindering their ability to generalize effectively and handle real-world data variations.
Table 7. Mutual Information of PPG optical features in the BP prediction task.Even all features combined have a small Info-Fraction, and most of that is captured by the heart rate feature alone.www.nature.com/scientificreports/While research on non-invasive approaches to estimate health vitals such as heart rate and blood oxygen saturation has made tremendous progress, enabling these technologies to become ubiquitous in the last decade, progress in non-invasive cuffless BP estimation has been slow despite witnessing similar research interest.This has prompted us to question whether the problem itself is ill-conditioned and if the PPG signal contains enough information to predict BP in the first place.In order to answer these questions, we have proposed a set of tools based on multi-valued mapping and mutual information to check if an input signal is a good predictor of the desired output.The multi-valued mapping checker allows us to find samples close in input space but far in output space.We found many such samples in both the MIMIC and Aurora-BP datasets.Searching for multi-valued mappings was trivial once appropriate distance metric and thresholds were defined, qualitative and quantitative results show that almost identical PPG waveforms can have very different BP waveforms.Next, we looked at the entropy of the features by computing mutual information.MI was extremely low for both hand-crafted and learned auto-encoder features.In comparison, heart rate and RWAT prediction tasks from PPG PWA have much lower multi-valued mapping factors and much higher mutual information indicating that the task is relatively well conditioned compared to PPG PWA to BP.We believe that these tools are relevant for feasibilty analysis in similar tasks involving wearable data, such as predicting stress levels from PPG [65][66][67] and estimating blood glucose levels from PPG [68][69][70] .
Our study does not aim to prove that blood pressure estimation from PPG PWA is impossible; however, it indicates that the task is very challenging, and evaluating performance fairly is non-trivial.To navigate this complexity, we present a set of tools that future research can leverage to avoid the pitfalls identified here.We hope our work can serve as a milestone and stimulate further discussion and exploration in the following areas: (1) Data Diversity: Collecting comprehensive datasets that represent subjects from diverse demographics and Table 8.Mutual Information of patient demographic data, PPG optical features and features derived using ECG, for the Aurora-BP dataset 60 .While all features combined have an Info-Fraction of just 9.8% for the SBP prediction task, they encode much more information for the HR prediction (87.7%) and RWAT prediction (64.6%) tasks.www.nature.com/scientificreports/Algorithm 1 has two key components: a distance function for comparing the input samples and an optimal threshold value for filtering the multi-valued mappings.
Distance Function: Searching for multi-valued mappings in a dataset requires a metric to quantify the distance between the input samples.However, choosing the right distance function is not always obvious, and one needs to be careful about the implicit assumptions in any given metric.For example, cross-correlation, dynamic time warping (DTW) 76 , and Euclidean distance are ways to measure the distance between two time-series/ waveforms, and each has specific characteristics-cross-correlation is phase invariant, DTW is scale invariant in the time dimension, and Euclidean distance is translation invariant.For cross-correlation, a sliding window dot-product of the two input data series is computed to find the point where the similarity is maximized; DTW computes an optimal match by reducing the minimum-edit distance between the two series; Euclidean distance measures the similarity between the two data series using the L2 distance.
Ideally, the distance function should align well with the task requirements.Among the three distance functions, DTW makes the similarity metric invariant with respect to the time scale.However, it is known that BP has a direct dependency on heart rate, which in turn is determined by the periodicity of the PPG waves.Thus, the time scale invariance property of DTW will result in information loss for this task, making it a bad choice as a distance function.The Euclidean distance used in isolation is not a good choice either, as even the same PPG signals slightly shifted in time can result in a high Euclidean distance value.Since the relationship between PPG and BP should not change with small shifts of the PPG signal forward or backward in time, such a distance metric is not suitable.Therefore, cross-correlation is ideal to create an appropriate distance metric.Although the cross-correlation based distance metric worked well in our experiments, we found that aligning PPG signals via cross-correlation followed by computing the Euclidean distance between the aligned signals appeared logical.We used this distance measure for all our experiments.
Optimal Threshold: After choosing the appropriate distance function, we need to identify an optimal distance threshold, below which two signals can be considered "equal".However, it is not straightforward to find such a threshold.If the threshold is very generous (i.e., high), we will end up selecting distant input signals as equal, and obtain misleading multi-valued mappings.On the other hand, if the threshold is too strict (i.e., low), we may not find any multi-valued mappings even for ill-conditioned functions, as the chances of two input signals being identical, especially in the presence of noise, are very small.To identify the optimal threshold for filtering multi-valued mappings, we calculate the Euclidean distance between two consecutive aligned PPG waves, each 2 seconds in duration.This interval was chosen because it represents an ideal time frame in which the signal remains consistent.Ideally, the difference between 2 consecutive PPG waves should account for an irreducible error, and this can be used as a threshold for filtering multi-valued mappings.Figure 9 illustrates the results of this analysis, which indicates that a majority of the PPG waves exhibit a Euclidean distance of ≤ 1, which led us to choose 1.0 as the threshold for our experiment.
Note that our multi-valued mapping check is a one-way method, i.e., if we are able to find multi-valued mappings, it implies an ill-conditioned f; however not finding multi-valued mappings does not guarantee existence of a well-defined f.This is because Algorithm 1 may fail to find signals close in the input space due to sparsity of the dataset.The mutual information check discussed next provides a complimentary method.

B.2 Mutual information check
Mutual Information (MI) is an information theoretical measure of the dependence between two random variables X and Y, defined as: where H is the Shannon entropy function ( H(X) = − i p(x i )log(p(x i )) .For continuous analog data, it is computed via limiting density of discrete points (LDDP) 77 .The marginal entropies H(X) and H(Y) represent the (1) amount of information needed to describe the outcome of the random variable.This is same as the uncertainty of the random variable.H(X|Y ) and H(Y |X) are conditional entropies, and denote the amount of information needed to describe the outcome of one random variable when the value of the other variable is known.This can also be thought of as the amount of uncertainty left in one random variable when the other is known.The mutual information I can be then interpreted as the amount of information (or reduction in uncertainty) that knowing one variable provides about the other.For example, I(X; Y) is zero if X and Y are independent, while it is maximum when X is a deterministic function of Y or vice-versa.Mutual information can be an effective measure in our case to evaluate whether the input signal (x) can be a good predictor of the output health label (y).However, since the computation of MI relies on estimation of probability density functions of the random variables, it is non-trivial to robustly estimate the MI for high dimensional data such as the time series PPG data.To overcome this curse of dimensionality, we recommend the following dimensionality reduction approaches before computing the MI.
Auto-Encoder.Since MI is invariant under smooth invertible transformations of the variables, we propose using an auto-encoder to aggressively reduce the input space dimensionality.We train an auto-encoder with the least number of bottleneck features needed to achieve a target mean-squared reconstruction loss of 0.1 on the normalized dataset.For the MIMIC and Aurora-BP dataset, we achieved this target with a bottleneck size of 20, at which the MI estimation worked robustly.
Hand-Crafted Features.As an alternate solution to using an auto-encoder, we can use hand-crafted features extracted from the input signal based on prior literature 63,64 and use these features for MI estimation.For example, in the task of BP prediction from PPG signal, common features include normalized systolic slope, heart rate, heart rate variability, etc.The MI estimation process helps us understand the importance of each of these features both collectively and independently.Note that in the case of hand-crafted features, there is always the concern of completeness (i.e., if the features extracted enough information from the input needed for the task), thus we recommend the auto-encoder approach whenever possible.

C Analysis details C.1 Datasets
Our work builds on two datasets, the properties of which are critical to understand the results of our work.
MIMIC II: The MIMIC II dataset contains records of continuous high-resolution physiological waveforms of the patients in the ICU, such as ABP, PPG, and ECG sampled at 125Hz.The dataset consists of 67,830 records of varying duration from 30,000 patients 71 .For the purpose of our study, we perform our analysis on a preprocessed subset of the MIMIC II dataset, consisting of 12,000 records from 942 patients 52 .This subset is particularly useful for our analysis as it includes a sufficient number of patients for training and testing, compliant with AAMI standards, and has been commonly utilized in previous research(Table 1).
Aurora-BP: The Aurora-BP dataset 60 consists of 24,650 records from 483 subjects.Each subject has multiple records of varying duration, which were collected at rest or while performing activities such as exercise and brisk walk.The records are collected from multiple sensors/devices including optical PPG, EKG, tonometer, accelerometer, and cuff-based Blood Pressure.

C.2 Performance standards
To contextualize the performance of SBP (Systolic BP) prediction task, two benchmarks have been widely used: AAMI and BHS standards.The criteria of the AAMI (Association for the Advancement of Medical Instrumentation) standards 53 are that the test set should comprise of at least 85 subjects, with at least 10% of them having an SBP above 180 mmHg and at least 10% having an SBP under 100 mmHg.For a test device to be compliant with the AAMI standards, the SBP prediction must have a bias under 5 mmHg and error standard deviation (SD) under 8 mmHg on the test set.The BHS (British Hypertension Society) 54 standards criteria states that the test set should consist of at least 85 subjects and that the cohort should be representative of the target audience of the device.The performance of the test device is divided into grades (Table 10).Additionally, the test data should cover the overall pressure range, specifically in these three ranges: ≤ 130, 130-160, and ≥ 160 mmHg.

C.3 Other considerations
Dataset size: To understand the effect of data size on MI, and verify whether our dataset had enough samples to enable robust MI estimation, we conducted the following experiment.We took a randomly selected slice of the data (ranging from 0.1 to 100% data) and computed the combined MI over 20 runs (this technique is known as bootstrapping).We performed this analysis for both the MIMIC and Aurora-BP datasets.As shown in Figures 10(A) and (B), although the estimates at smaller dataset sizes resulted in high variation, the variation bounds are very tight at higher sizes.This imparts confidence that our MI estimates over the full datasets are robust.Interestingly, we also found that using a smaller dataset can result in higher estimates of the MI values.

Figure 1 .
Figure1.When designing end-to-end machine learning models researchers often use techniques such as: (A) providing the model with observations from similar patients, (B) constraining the task (e.g., limiting the distribution of labels), (C) calibrating models using data from a participant.When doing so it can often be difficult to identify how these steps impact the integrity of a model, or (D) preprocessing to filter out problematic samples (e.g., noisy inputs).

Figure 2 .
Figure 2.Our reference network, is used to evaluate the impact on performance due to the issues mentioned in Section "Review of the Results and Limitations of Prior Work".The network has 28M trainable parameters, takes a 3-channel input (PPG, VPG, APG), and outputs the SBP prediction.The model is optimized using a mean squared error loss.

Figure 3 .
Figure3.Every participant (P) has multiple data records (R), and each record is divided into multiple overlapping windows (W).Each window forms a data sample.In No-Overlap, the train and test data are split at the participant level, while in Domain-Overlap, the split happens at the record level, and in Data-Overlap, the split happens at the window level.

Figure 4 .
Figure 4.The distribution of systolic BP values in the: (left) Aurora-BP dataset and (right) MIMIC dataset.In the MIMIC dataset, the SBP values lie in the range 65-200 mmHg, however prior works ignore samples with SBP values outside the range of 75-165 mmHg.

Figure 5 .Figure 6 .
Figure 5. (A) The offset calibration method's performance falls off quickly after the first few days.(B) Performance of our reference network with different auto-correlation thresholds on the MIMIC dataset.

Figure 9 .
Figure 9.The distribution of Euclidean Distances between pairs of aligned consecutive PPG waves.

Table 2 .
Performance 21 PPG2ABP21on different test-train splits with varying degrees of dataset overlap.Even subtle leakages can result in large (but artificial) accuracy improvements.

Table 3 .
Performance of the reference network on different SBP ranges on the MIMIC dataset.Constraining the data range can result in significant (but artificial) accuracy improvements.

Table 4 .
Performance of different calibration-methods on the MIMIC dataset.The incorrect Naïve calibration methods perform very well, underscoring problems with the evaluation methodology.

Table 9 .
60tual information of auto-encoder features.The same trend of Table8holds.While the SBP task has low Info-Fraction, the features encode much more information for the HR and RWAT tasks.Ground truth for RWAT is only available for the Aurora-BP dataset60.Multiple modalities: Exploring the integration of PPG with other physiological signals holds immense potential for enhancing prediction accuracy and providing a more holistic view of cardiovascular health.(3) Improved Sensors: Advancements in sensor technology are crucial to capture higherfidelity PPG data with minimal external noise and environmental variables.We believe that focusing on these critical areas will lead to generalizable and scalable solutions, empowering a future where everyone can benefit from the accessibility and convenience of non-invasive cuffless BP estimation.

Table 10 .
Grading scale of test devices as per British Hypertension Society (BHS).